Start date: 04-15-2020
End date: 04-17-2020
Analysed by: Ruth Chia
Working directory: /data/NDRS_LNG/MyastheniaGravis/updated.April2020/US
Run sample- and variant-level QC on raw genotype plink files.
Submit for imputation using Topmed Imputation Server (hg38).
Generate PCs and covariate file.
Run GLM (dosages) using imputed data
QC script is written in two parts i.e.
Sample-level QC includes:
STEP 1 : SAMPLE-LEVEL FILTERING (HETEROZYGOSITY, CALL RATE, GENDER CHECK, ANCESTRY)
STEP 1.1: Sample-level filtering (Heterozygosity check, removal of het outliers)
STEP 1.2: Sample-level filtering (sample call rate check, maximum missing calls per sample --mind 0.05)
STEP 1.3: Sample-level filtering (Gender check, remove samples that failed gender check)
STEP 1.4: Sample-level filtering (Ancestry check, remove samples that are not european)
STEP 1.5: Generate list of duplicates; remove duplicates
STEP 1.6: Generate list of related samples
Variant-level QC includes:
STEP 2 : VARIANT-LEVEL FILTERING (Variant inclusion criteria to account for genotyping batch differences; then make final bfiles with --geno 0.05)
STEP 2.1: Case/control nonrandom missingness test (i.e. exclude SNPs with P \< 1E-4)
STEP 2.2: Haplotype-based test for non-random missing genotype data (i.e. exclude SNPs with P \< 1E-4)
STEP 2.3: Make list of varaints that failed HWE (controls) (i.e. exclude SNPs with midP \< 1E-10)
STEP 2.4: Final --geno 0.05 to account for final variant missingness
%%bash
echo "sh ./scripts/QC_preimputation_v2_keepRelated_part1of2.mg.sh merged1 US_mg chiarp@mail.nih.gov" > qc1.swarm
swarm --file qc1.swarm --gres=lscratch:200 --logdir swarmOE_qc --partition quick \
-g 64 -t auto --time 4:00:00 --sbatch "--mail-type=BEGIN,FAIL,TIME_LIMIT_80"
Notes:
Previously submitted job for imputation but 275 samples failed imputation QC. made the list of 'bad samples' from statistic report generated and used this list remove bad samples and generated a set of 'good samples'.
List was created locally on my computer using find and replace to remove unwanted texts and copied to working directory on biowulf. File is: /data/chiarp/MG_GWAS/Discovery_US_keepRelated/badSamples_failedImputationQC_toRemove.txt
%%bash
# Copy scripts folder and 'badSamples' list to here
cp -r /data/NDRS_LNG/ALS_imputationHG19_withChrX_2020/scripts .
cp /data/chiarp/MG_GWAS/Discovery_US_keepRelated/badSamples_failedImputationQC_toRemove.txt .
%%bash
module load plink/1.9.0-beta4.4
mkdir CLEAN.rawGenotype.keepRelated
plink \
--bfile FILTERED.US_mg_keepRelated \
--remove badSamples_failedImputationQC_toRemove.txt \
--allow-no-sex \
--keep-allele-order \
--make-bed \
--out CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005
%%bash
module load plink/1.9.0-beta4.4
mkdir CLEAN.rawGenotype.UNRELATED
plink \
--bfile CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005 \
--remove FILTERED.to.remove.GRM_matrix.noDups.relatedSamples_FIDspaceIID.txt \
--extract hwe/Vars.passhwecontrols1e-10.geno005.snplist \
--keep-allele-order \
--allow-no-sex \
--make-bed \
--out CLEAN.rawGenotype.UNRELATED/FILTERED.US_mg_noDups.UNRELATED.hwe1e-10.geno005
%%bash
cp CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005.* .
chmod u+x ./scripts/QC_preimputation_v2_keepRelated_part2of2.sh
echo "./scripts/QC_preimputation_v2_keepRelated_part2of2.sh FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005 chiarp@mail.nih.gov" > QC_preimputation_part2.swarm
swarm --file QC_preimputation_part2.swarm --gres=lscratch:200 \
--module plink,R -g 32 --time 24:00:00 -t auto \
--sbatch "--mail-type=BEGIN,FAIL,END,TIME_LIMIT_80"
%%bash
# Remove some intermediate files that are not needed to save some space
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005-chr*.check.*
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005X.*
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005[0-9].*
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005[0-9][0-9].*
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005-updated-chr*.*
rm *-FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005-HRC.txt
rm Run-plink.sh
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005.*
To create covariate files for:
RELATED+UNRELATED samples UNRELATED samples onlyAdditional notes: Removed 464 dbGAP samples that cannot be included due to permission issues.
%%bash
DATA="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US"
grep "phs000372" $DATA/CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005.fam | awk '{print $1,$2}' > $DATA/SamplesToRemove.dbGAP.FIDspaceIID.txt
grep "phs000397-4433" $DATA/CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005.fam | awk '{print $1,$2}' >> $DATA/SamplesToRemove.dbGAP.FIDspaceIID.txt
grep "phs000428-13949" $DATA/CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005.fam | awk '{print $1,$2}' >> $DATA/SamplesToRemove.dbGAP.FIDspaceIID.txt
echo "Number of dbGAP samples to remove:"
wc -l $DATA/SamplesToRemove.dbGAP.FIDspaceIID.txt
head $DATA/SamplesToRemove.dbGAP.FIDspaceIID.txt | column -t
%%bash
module load plink/1.9.0-beta4.4
module load R/3.5.2
module load GCTA/1.92.0beta3
module load flashpca/2.0
mkdir PCAcovs
# Prune snps
plink \
--bfile CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005 \
--remove /data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/SamplesToRemove.dbGAP.FIDspaceIID.txt \
--allow-no-sex \
--maf 0.01 \
--geno 0.01 \
--hwe 5e-6 \
--autosome \
--exclude range ./scripts/exclusion_regions_hg19.txt \
--make-bed \
--out PCAcovs/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter \
--memory 119500 --threads 19
plink \
--bfile PCAcovs/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter \
--allow-no-sex \
--geno 0.01 \
--maf 0.05 \
--indep-pairwise 1000 10 0.02 \
--out PCAcovs/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning \
--memory 119500 --threads 19
# All samples (related + unrelated samples; no dups)
plink \
--bfile PCAcovs/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter \
--remove /data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/SamplesToRemove.dbGAP.FIDspaceIID.txt \
--allow-no-sex \
--extract PCAcovs/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.in \
--keep-allele-order \
--make-bed \
--out PCAcovs/pruned.US_mg_noDups.keepRelated \
--memory 119500 --threads 19
# Unrelated samples only (no dups)
plink \
--bfile PCAcovs/pruned.US_mg_noDups.keepRelated \
--remove GRM/FILTERED.to.remove.GRM_matrix.noDups.relatedSamples_FIDspaceIID.txt \
--allow-no-sex \
--extract PCAcovs/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.in \
--keep-allele-order \
--make-bed \
--out PCAcovs/pruned.US_mg_noDups.UNRELATED \
--memory 119500 --threads 19;
# Calculate/generate PCs based on pruned data set
cd PCAcovs
flashpca --bfile pruned.US_mg_noDups.keepRelated -d 10 --suffix _US_mg_noDups.keepRelated_forPCA --numthreads 19
flashpca --bfile pruned.US_mg_noDups.UNRELATED -d 10 --suffix _US_mg_noDups.UNRELATED_forPCA --numthreads 19
# Move all log files to a new folder
mkdir logFiles
mv *.log logFiles
# Remove intermediate files
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter.bed
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter.bim
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter.fam
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.in
rm FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005_filter_pruning.prune.out
Create covariate file
Phenotype file as input for here: /data/chiarp/MG_GWAS/Discovery_US_keepRelated/phenoINFO_forPCAcovs.txt. Then subset to only keep samples that made it to the 'CLEAN' plink1 files.
%%bash
module load R/3.5.2
R --vanilla --no-save
# Load libraries
require(data.table)
require(tidyverse)
# UNRELATED samples only (no dups)
## Read in files
pc <- fread("PCAcovs/pcs_US_mg_noDups.UNRELATED_forPCA", header=T)
fam <- fread("PCAcovs/pruned.US_mg_noDups.UNRELATED.fam", header=F)
fam <- fam %>% rename(FID = V1, IID = V2, MAT = V3, PAT = V4, SEX = V5, PHENO= V6)
pheno <- fread("/data/chiarp/MG_GWAS/Discovery_US_keepRelated/phenoINFO_forPCAcovs.txt", header=T)
pheno <- pheno %>% rename(genderONFILE = gender)
dim(pc)
dim(fam)
dim(pheno)
## Merge pc, fam and pheno files.
## Also create additional column so that the sample ID matches the format in the dose.vcf.gz (i.e. FID_IID)
data1 <- merge(fam,pheno,by="IID")
data2 <- merge(data1, pc, by=c("FID","IID"))
data2$GENDER <- data2$SEX - 1
data2$genderONFILE[is.na(data2$genderONFILE)] <- "NA"
data2$FID_IID <- paste(data2$FID,data2$IID, sep="_")
data2$age_at_onset <- as.numeric(data2$age_at_onset)
dim(data2)
write.table(data2, "COVARIATES.US_mg_noDups.UNRELATED.txt", sep="\t",quote=F, col.names=T,row.names=F)
data3 <- data2 %>%
mutate(FID = FID_IID,
IID = FID_IID) %>%
select(-FID_IID)
dim(data3)
head(data3)
write.table(data3, "COVARIATES.US_mg_noDups.UNRELATED.forImputed.txt", sep="\t",quote=F, col.names=T,row.names=F)
## Summarise numbers by pheno or diagnosis and gender or sex
## Note that there is a mismatch of gender from pheno file compared to the actual SEX information in the plink file.
## But since these samples survived gender check.
## Decided to stick with information in the .fam file
table(data2$PHENO, data2$SEX)
table(data2$diagnosis, data2$SEX)
table(data2$diagnosis, data2$genderONFILE)
table(data2$diagnosis, data2$GENDER)
head(data2)
## Run step() to determine which covariate to correct for for association analysis
data2$CASE <- data2$PHENO - 1
unrelated <- glm(CASE ~ GENDER + age_at_onset + PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 , data = data2, family = "binomial"(link = "logit"))
summary(unrelated)
step(unrelated)
UNRELATED samples only (no dups):
7896.68GENDER,age_at_onset,PC1,PC2,PC4,PC5,PC6,PC8,PC9,PC10%%bash
# Plot PCs to check to make sure cases and controls are not clustering out
module load R/3.5.2
Rscript ./scripts/PCplots_MG.GWAS.updated.R COVARIATES.US_mg_noDups.keepRelated.txt PCplots_US_mg_noDups.keepRelated
Rscript ./scripts/PCplots_MG.GWAS.updated.R COVARIATES.US_mg_noDups.UNRELATED.txt PCplots_US_mg_noDups.UNRELATED
from IPython.display import display
from PIL import Image
print("PC plots for samples and cases (UNRELATED samples only)")
PCplot="PCplots_US_mg_noDups.UNRELATED_prunedSNPs_hwe5e-6.geno001.maf001.jpg"
display(Image.open(PCplot))
%%bash
module load R/3.5.2
R --vanilla --no-save
# Load libraries
require(data.table)
require(tidyverse)
# read in covariate file with PC, age and sex info
data <- fread("COVARIATES.US_mg_noDups.keepRelated.txt",header=T)
data$age_at_onset[is.na(data$age_at_onset)] <- "NA"
tags <- c("Age.bin0","Age.bin1","Age.bin2","Age.bin3", "Age.bin4", "Age.bin5", "Age.bin6", "Age.bin7", "Age.bin8","Age.bin9")
vgroup <- as_tibble(data) %>%
mutate(tag = case_when(
age_at_onset == "NA" ~ tags[1],
age_at_onset <= 20 ~ tags[2],
age_at_onset > 20 & age_at_onset <= 30 ~ tags[3],
age_at_onset > 30 & age_at_onset <= 40 ~ tags[4],
age_at_onset > 40 & age_at_onset <= 50 ~ tags[5],
age_at_onset > 50 & age_at_onset <= 60 ~ tags[6],
age_at_onset > 60 & age_at_onset <= 70 ~ tags[7],
age_at_onset > 70 & age_at_onset <= 80 ~ tags[8],
age_at_onset > 80 & age_at_onset <= 90 ~ tags[9],
age_at_onset > 90 ~ tags[10]
))
# subset to gender and pheno groups
female.cases <- data.frame(subset(vgroup, vgroup$SEX == "2" & vgroup$PHENO == "2"))
female.controls <- data.frame(subset(vgroup, vgroup$SEX == "2" & vgroup$PHENO == "1"))
male.cases <- data.frame(subset(vgroup, vgroup$SEX == "1" & vgroup$PHENO == "2"))
male.controls <- data.frame(subset(vgroup, vgroup$SEX == "1" & vgroup$PHENO == "1"))
## Female cases
datalist = list()
for (i in 1:dim(table(female.cases$tag)))
{
temp1 <- subset(female.cases, female.cases$tag == names(table(female.cases$tag)[i]))
pc1.center <- mean(temp1$PC1)
pc2.center <- mean(temp1$PC2)
temp1$dist.toCenter <- sqrt((temp1$PC1-pc1.center)^2 + (temp1$PC2-pc2.center)^2)
temp1 <- temp1 %>% arrange(dist.toCenter)
temp1$group <- factor(rep(1:2, length.out = nrow(temp1)))
datalist[[i]] <- temp1
}
female.cases2 = do.call(rbind, datalist)
## Female controls
datalist = list()
for (i in 1:dim(table(female.controls$tag)))
{
temp1 <- subset(female.controls, female.controls$tag == names(table(female.controls$tag)[i]))
pc1.center <- mean(temp1$PC1)
pc2.center <- mean(temp1$PC2)
temp1$dist.toCenter <- sqrt((temp1$PC1-pc1.center)^2 + (temp1$PC2-pc2.center)^2)
temp1 <- temp1 %>% arrange(dist.toCenter)
temp1$group <- factor(rep(1:2, length.out = nrow(temp1)))
datalist[[i]] <- temp1
}
female.controls2 = do.call(rbind, datalist)
## Female cases
datalist = list()
for (i in 1:dim(table(male.cases$tag)))
{
temp1 <- subset(male.cases, male.cases$tag == names(table(male.cases$tag)[i]))
pc1.center <- mean(temp1$PC1)
pc2.center <- mean(temp1$PC2)
temp1$dist.toCenter <- sqrt((temp1$PC1-pc1.center)^2 + (temp1$PC2-pc2.center)^2)
temp1 <- temp1 %>% arrange(dist.toCenter)
temp1$group <- factor(rep(1:2, length.out = nrow(temp1)))
datalist[[i]] <- temp1
}
male.cases2 = do.call(rbind, datalist)
## Female controls
datalist = list()
for (i in 1:dim(table(male.controls$tag)))
{
temp1 <- subset(male.controls, male.controls$tag == names(table(male.controls$tag)[i]))
pc1.center <- mean(temp1$PC1)
pc2.center <- mean(temp1$PC2)
temp1$dist.toCenter <- sqrt((temp1$PC1-pc1.center)^2 + (temp1$PC2-pc2.center)^2)
temp1 <- temp1 %>% arrange(dist.toCenter)
temp1$group <- factor(rep(1:2, length.out = nrow(temp1)))
datalist[[i]] <- temp1
}
male.controls2 = do.call(rbind, datalist)
# Merge subsets abck to single data frame and split by 'Set'
data0 <- rbind(female.cases2,female.controls2,male.cases2,male.controls2)
Set1 <- subset(data0, data0$group == "1")
Set2 <- subset(data0, data0$group == "2")
table(Set1$SEX,Set1$PHENO)
table(Set2$SEX,Set2$PHENO)
table(Set1$tag,Set1$PHENO)
table(Set2$tag,Set2$PHENO)
table(Set1$tag,Set1$SEX)
table(Set2$tag,Set2$SEX)
write.table(Set1,"US_mg_noDups.keepRelated.subset1.txt", sep="\t",quote=F,row.names=F,col.names=T)
write.table(Set2,"US_mg_noDups.keepRelated.subset2.txt", sep="\t",quote=F,row.names=F,col.names=T)
write.table(Set1[,c("FID","IID")],"US_mg_noDups.keepRelated.subset1.FIDspaceIID.txt", sep=" ",quote=F,row.names=F,col.names=T)
write.table(Set2[,c("FID","IID")],"US_mg_noDups.keepRelated.subset2.FIDspaceIID.txt", sep=" ",quote=F,row.names=F,col.names=T)
%%bash
module load plink/1.9.0-beta4.4
mkdir CLEAN.rawGenotype.keepRelated.subset1
plink \
--bfile CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005 \
--remove badSamples_failedImputationQC_toRemove.txt \
--keep US_mg_noDups.keepRelated.subset1.FIDspaceIID.txt \
--allow-no-sex \
--keep-allele-order \
--make-bed \
--out CLEAN.rawGenotype.keepRelated.subset1/FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005
%%bash
cp CLEAN.rawGenotype.keepRelated.subset1/FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005.* .
chmod u+x ./scripts/QC_preimputation_v2_keepRelated_part2of2.updated.sh
echo "./scripts/QC_preimputation_v2_keepRelated_part2of2.updated.sh FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005 chiarp@mail.nih.gov" > QC_preimputation_part2.subset1.swarm
swarm --file QC_preimputation_part2.subset1.swarm --gres=lscratch:200 \
--module plink,R -g 32 --time 24:00:00 -t auto \
--sbatch "--mail-type=BEGIN,FAIL,END,TIME_LIMIT_80"
%%bash
# Remove intermediate files
rm FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005[0-9].log
rm FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005[0-9][0-9].log
rm FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005[0-9].vcf
rm FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005[0-9][0-9].vcf
rm FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005-updated-chr*.*
rm *-FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005-HRC.txt
rm -r Imputation_HRC_bundle
rm Run-plink.sh
%%bash
module load plink/1.9.0-beta4.4
mkdir CLEAN.rawGenotype.UNRELATED.subset1
plink \
--bfile CLEAN.rawGenotype.keepRelated.subset1/FILTERED.US_mg_noDups.keepRelated.subset1.hwe1e-10.geno005 \
--remove FILTERED.to.remove.GRM_matrix.noDups.relatedSamples_FIDspaceIID.txt \
--keep-allele-order \
--allow-no-sex \
--make-bed \
--out CLEAN.rawGenotype.UNRELATED.subset1/FILTERED.US_mg_noDups.UNRELATED.subset1.hwe1e-10.geno005
%%bash
echo "sh scripts/MG.US.subset1.hg38_ImputationResultsDownload_and_plink_06-26-2020.sh" > MG.US.subset1.hg38_ImputationDownload.swarm
swarm --file MG.US.subset1.hg38_ImputationDownload.swarm \
--logdir swarmOE_ImputationDownload \
--gres=lscratch:200 \
--module plink/1.9.0-beta4.4 \
-g 2 -t auto --time 12:00:00 \
--sbatch "--mail-type=BEGIN,FAIL,TIME_LIMIT_80"
%%bash
DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US"
cd $DIR/Imputation.subset1.hg38
for CHNUM in {1..22}
do
echo "tabix -p vcf chr${CHNUM}.dose.vcf.gz" >> tabix.swarm
done
swarm --file tabix.swarm \
--logdir swarmOE_ImputationDownload \
--gres=lscratch:200 \
--module samtools/1.11 \
-g 2 -t auto --time 12:00:00 \
--sbatch "--mail-type=BEGIN,FAIL,TIME_LIMIT_80"
%%bash
DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US"
cd $DIR/Imputation.subset1.hg38
wc -l chr[0-9]*.info
%%bash
mkdir Imputation.subset1.hg38/Vars.Rsq03MAF00001
cd Imputation.subset1.hg38
for CHNUM in {1..22} X
do
awk '{if($5 > 0.0001 && $7 > 0.3) print $1}' chr${CHNUM}.info | tail -n +2 > Vars.Rsq03MAF00001/Rsq03MAF00001.chr${CHNUM}.txt
done
!wc -l Imputation.subset1.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr[0-9]*.txt
%%bash
module load plink/1.9.0-beta4.4
mkdir CLEAN.rawGenotype.keepRelated.subset2
plink \
--bfile CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005 \
--remove badSamples_failedImputationQC_toRemove.txt \
--keep US_mg_noDups.keepRelated.subset2.FIDspaceIID.txt \
--allow-no-sex \
--keep-allele-order \
--make-bed \
--out CLEAN.rawGenotype.keepRelated.subset2/FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005
%%bash
cp CLEAN.rawGenotype.keepRelated.subset2/FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005.* .
chmod u+x ./scripts/QC_preimputation_v2_keepRelated_part2of2.updated.sh
echo "./scripts/QC_preimputation_v2_keepRelated_part2of2.updated.sh FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005 chiarp@mail.nih.gov" > QC_preimputation_part2.subset2.swarm
swarm --file QC_preimputation_part2.subset2.swarm --gres=lscratch:200 \
--module plink,R -g 32 --time 24:00:00 -t auto \
--sbatch "--mail-type=BEGIN,FAIL,END,TIME_LIMIT_80"
%%bash
# Remove intermediate files
rm FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005-chr*.check.*
rm FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005[0-9].log
rm FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005[0-9][0-9].log
rm FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005[0-9].vcf
rm FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005[0-9][0-9].vcf
rm FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005-updated-chr*.*
rm *-FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005-HRC.txt
rm -r Imputation_HRC_bundle
rm Run-plink.sh
%%bash
module load plink/1.9.0-beta4.4
mkdir CLEAN.rawGenotype.UNRELATED.subset2
plink \
--bfile CLEAN.rawGenotype.keepRelated.subset2/FILTERED.US_mg_noDups.keepRelated.subset2.hwe1e-10.geno005 \
--remove FILTERED.to.remove.GRM_matrix.noDups.relatedSamples_FIDspaceIID.txt \
--keep-allele-order \
--allow-no-sex \
--make-bed \
--out CLEAN.rawGenotype.UNRELATED.subset2/FILTERED.US_mg_noDups.UNRELATED.subset2.hwe1e-10.geno005
%%bash
echo "sh scripts/MG.US.subset2.hg38_ImputationResultsDownload_and_plink_06-26-2020.sh" > MG.US.subset2.hg38_ImputationDownload.swarm
swarm --file MG.US.subset2.hg38_ImputationDownload.swarm \
--logdir swarmOE_ImputationDownload \
--gres=lscratch:200 \
--module plink/1.9.0-beta4.4 \
-g 2 -t auto --time 12:00:00 \
--sbatch "--mail-type=BEGIN,FAIL,TIME_LIMIT_80"
%%bash
DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US"
cd $DIR/Imputation.subset2.hg38
for CHNUM in {1..22}
do
echo "tabix -p vcf chr${CHNUM}.dose.vcf.gz" >> tabix.swarm
done
swarm --file tabix.swarm \
--logdir swarmOE_ImputationDownload \
--gres=lscratch:200 \
--module samtools/1.11 \
-g 2 -t auto --time 12:00:00 \
--sbatch "--mail-type=BEGIN,FAIL,TIME_LIMIT_80"
%%bash
DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US"
cd $DIR/Imputation.subset2.hg38
wc -l chr[0-9]*.info
%%bash
mkdir Imputation.subset2.hg38/Vars.Rsq03MAF00001
cd Imputation.subset2.hg38
for CHNUM in {1..22}
do
awk '{if($5 > 0.0001 && $7 > 0.3) print $1}' chr${CHNUM}.info | tail -n +2 > Vars.Rsq03MAF00001/Rsq03MAF00001.chr${CHNUM}.txt
done
!wc -l Imputation.subset2.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr[0-9]*.txt
%%bash
DATA="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US"
DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/Analysis.GLM.hg38/US.JointPostImputation"
mkdir $DIR/sharedVars.postImputation
cd $DIR/sharedVars.postImputation
for CHNUM in {1..22}
do
cat $DATA/Imputation.subset1.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr${CHNUM}.txt $DATA/Imputation.subset2.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr${CHNUM}.txt | sort | uniq -d > $DIR/SharedVars.postImputation.Rsq03MAF00001.chr${CHNUM}.txt
done
%%bash
DATA="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US"
DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/Analysis.GLM.hg38/US.JointPostImputation"
mkdir $DIR/merged.vcf
cd $DIR/merged.vcf
for CHNUM in {1..22};
do
echo "bcftools merge \
--output US.chr${CHNUM}.vcf.gz \
--output-type z \
-m id \
--info-rules R2:avg \
--threads 32 \
$DATA/Imputation.subset1.hg38/chr${CHNUM}.dose.vcf.gz $DATA/Imputation.subset2.hg38/chr${CHNUM}.dose.vcf.gz" >> merge.vcf.swarm
done
swarm --file merge.vcf.swarm --logdir swarmOE_merge.vcf \
--gres=lscratch:200 --time 12:00:00 -g 120 -t auto --module bcftools/1.9
%%bash
DATA="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US"
DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/Analysis.GLM.hg38/US.JointPostImputation"
grep "phs000372" $DATA/CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005.fam | awk '{print $1"_"$2,$1"_"$2}' > $DIR/SamplesToRemove.dbGAP.FID_IID.forImputed.txt
grep "phs000397-4433" $DATA/CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005.fam | awk '{print $1"_"$2,$1"_"$2}' >> $DIR/SamplesToRemove.dbGAP.FID_IID.forImputed.txt
grep "phs000428-13949" $DATA/CLEAN.rawGenotype.keepRelated/FILTERED.US_mg_noDups.keepRelated.hwe1e-10.geno005.fam | awk '{print $1"_"$2,$1"_"$2}' >> $DIR/SamplesToRemove.dbGAP.FID_IID.forImputed.txt
echo "Number of dbGAP samples to remove:"
wc -l $DIR/SamplesToRemove.dbGAP.FID_IID.forImputed.txt
head $DIR/SamplesToRemove.dbGAP.FID_IID.forImputed.txt | column -t
What is needed:
Location of files:
USmerged
/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/Analysis.GLM.hg38/US.JointPostImputation/merged.vcf/US.chr${CHNUM}.vcf.gz/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/Analysis.GLM.hg38/US.JointPostImputation/sharedVars.postImputation/SharedVars.postImputation.Rsq03MAF00001.chr${CHNUM}.txt/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/COVARIATES.US_mg_noDups.UNRELATED.forImputed.txtItaly
/data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/Imputation.hg38 /chr${CHNUM}.dose.vcf.gz/data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/Imputation.hg38/Vars.Rsq03MAF00001/Rsq03MAF00001.chr${CHNUM}.txt/data/NDRS_LNG/MyastheniaGravis/updated.April2020/Itals/COVARIATES.Itals_mg_noDups.UNRELATED.forGLM.txtscript: vcfImputedToplink1.USmerged.sh
#!/bin/bash
# Warning message to indicate argument requirement for script to run
if [ $# -eq 2 ]
then
echo "vcfImputedToplink1.USmerged.sh running"
echo "This script should be executed in biowulf. If not, please terminate job."
else
echo "Need directory where file is in, CHNUM of imputed vcf.gz file"
echo "Note: This script is written to be executed in biowulf."
exit
fi
# Arguments to pass
DIR=$1
CHNUM=$2
# Load modules on biowulf
module load plink/1.9.0-beta4.4
DATA="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/Analysis.GLM.hg38/US.JointPostImputation"
plink \
--vcf $DATA/merged.vcf/US.chr${CHNUM}.vcf.gz \
--double-id \
--extract $DATA/sharedVars.postImputation/SharedVars.postImputation.Rsq03MAF00001.chr${CHNUM}.txt \
--remove /data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/SamplesToRemove.dbGAP.FIDspaceIID.txt \
--pheno-name PHENO \
--pheno $DIR/COVARIATES.US_mg_noDups.UNRELATED.forImputed.txt \
--keep $DIR/Imputation.USmerged.hg38/SampleList.USmerged.UNRELATED.forImputed.txt \
--keep-allele-order \
--allow-no-sex \
--threads 32 \
--make-bed \
--out plinkForPRS/USmerged.UNRELATED.Imputed.pass.Rsq03MAF00001.chr${CHNUM}
%%bash
mkdir Imputation.USmerged.hg38
mkdir Imputation.USmerged.hg38/plinkForPRS
%%bash
cd Imputation.USmerged.hg38
DIR="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US"
# Make list of unrelated samples to keep
awk '{print $1,$2}' $DIR/COVARIATES.US_mg_noDups.UNRELATED.forImputed.txt > $DIR/Imputation.USmerged.hg38/SampleList.USmerged.UNRELATED.forImputed.txt
DATA="/data/NDRS_LNG/MyastheniaGravis/updated.April2020/US/Analysis.GLM.hg38/US.JointPostImputation"
for CHNUM in {1..22}
do
echo "sh vcfImputedToplink1.USmerged.sh $DIR $CHNUM" >> generatePRSplink.swarm
done
%%bash
cd Imputation.USmerged.hg38
swarm --file generatePRSplink.swarm --logdir swarmOE_vcfToPlink -g 120 --time 12:00:00 -t auto --module plink/1.9.0-beta4.4